feat(decide): pure DECIDE core + exhaustive truth-table tests#4
Conversation
…tests Replace the stubbed deciders in domain/decide with real, total, side-effect-free implementations: - ResolveProbeDecision: kill-intent short-circuits to terminal; failed probes and probe disagreement route to detecting; only runtime-dead + process-dead + no-recent-activity concludes killed. - ResolveOpenPRDecision: the PR pipeline ladder (ci_failing > changes_requested > approved+mergeable > approved > review_pending > idle-beyond > open). - ResolveTerminalPRStateDecision: merged -> idle/merged_waiting_decision, closed -> idle. - CreateDetectingDecision: anti-flap quarantine — unchanged-evidence counter with StartedAt preserved across the episode so the duration cap is a real wall-clock safety net; escalates to stuck at 3 ticks or 5m. - HashEvidence: strips timestamps/epochs and collapses whitespace before hashing so restamped-but-unchanged signals compare equal. Table tests cover every branch (100% statement coverage), including a consistency check that the open-PR ladder's display Status matches DeriveLegacyStatus over the canonical state it emits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
harshitsinghbhandari
left a comment
There was a problem hiding this comment.
Review — pure DECIDE core
Recommendation: Approve, with follow-ups. Verified independently in a throwaway worktree: gofmt clean, go build/vet pass, go test -race green at 100% statement coverage. The core logic is correct, pure, and exhaustively tested. Findings below are about the seam with the (not-yet-built) LCM, one semantic gap, and minor items — none are correctness bugs in the isolated functions.
The two areas called out in the request — both correct
- Probe branch ordering ✓
kill → any-probe-failed → runtime switch. A failed probe never concludes death; disagreements and indeterminate-process quarantine; onlyruntime-dead + process-dead + no-recent-activityreacheskilled. - Detecting counter/duration ✓ Preserving
StartedAtacross an evidence change while resetting the attempt counter is the right call — the 5-min cap is a whole-episode wall-clock safety net, not per-evidence, so a flapping-but-stuck session still escalates. The "duration cap fires even when evidence keeps flapping" test proves it.
Architecture & design (carry into the LCM PR)
- Probe decider conflates liveness with session activity state. A healthy runtime returns
SessionWorking / ReasonTaskInProgress. When the reaper feeds this throughApplyRuntimeObservation, the LCM must not let a liveness verdict clobber a more specific activity-derived state (e.g. a session inwaiting_inputset byApplyActivitySignal). Needs an explicit composition rule for which decider ownsSessionStatewhen the runtime is alive. Not a bug here — a contract to document in the LCM PR. mergeableis unreachable withoutApproved. The distilled ladder is "approved /none + mergeable → MERGEABLE", but the impl only reachesMERGEABLEonApproved && Mergeable. A mergeable PR on a repo with no required review falls through toPR_OPEN. EitherOpenPRInputshould express "no review required" orMergeableshould be weighed independently. Confirm intent before the LCM builds theSCMFacts → OpenPRInputmapping.
Integration contract (LCM PR)
- Detecting-memory lifecycle. The deciders correctly return
Detecting == nilfor every non-detecting outcome and non-nil only while quarantined. The LCM must therefore clear the persisted detecting memory (viaLifecyclePatch.ClearDetecting) whenever a decision'sDetectingis nil — otherwise a staleStartedAt/EvidenceHashleaks into a later episode.
Testing
- Extend the consistency cross-check.
TestResolveOpenPRDecisionDerivesConsistently(decision.Status == DeriveLegacyStatus(canonical)) is excellent but only covers the open-PR ladder. The probe and terminal deciders are also consistent (hand-verified) — adding them to the same test locks the invariant against drift.
Code quality (minor)
HashEvidenceepoch regex\b\d{10,13}\bstrips any 10–13 digit run, which would also clobber a legitimate large ID if one ever appears in an evidence string. Low risk today; worth a comment or tightening. (HH:MMwithout seconds also isn't stripped — fine for the current full-timestamp format.)
Nice work — the truth-table is thorough and the anti-flap semantics are right.
…ency test Address PR review (aa-1): - ResolveOpenPRDecision now keys MERGEABLE on Mergeable alone (checked before Approved). Mergeability is the authoritative merge gate, so a PR on a no-required-review repo no longer falls through to PR_OPEN. The approved+mergeable and approved-only cases are unchanged. - Broaden the derive-consistency test to cover the probe and terminal deciders too, not just the open-PR ladder. - Document the HashEvidence epoch-stripping regex's breadth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review, @harshitsinghbhandari. Addressed in 9920c6d: #2 — mergeable unreachable without #4 — extend the consistency cross-check ✅ Done. Renamed to #5 — #1 — probe decider conflates liveness with activity state — Agreed, deferring to the LCM PR. This is a composition-rule question (which decider owns #3 — detecting-memory lifecycle — Agreed, LCM-PR contract. The deciders already return Gates still green: gofmt clean, build/vet/ |
harshitsinghbhandari
left a comment
There was a problem hiding this comment.
Re-review — 9920c6d
Verified independently (throwaway worktree on 9920c6d): gofmt clean, build/vet pass, go test -race green at 100% coverage. All actionable findings resolved:
- #2 mergeable-without-approval — correct. Moving
Mergeableahead ofApprovedbut still behindCIFailing/ChangesRequestedis a genuine strict superset: every previously-MERGEABLEcase is unchanged, the no-required-review case is now surfaced, and CI-fail / changes-requested precedence is preserved. The "authoritative merge gate" comment captures the rationale. - #4 consistency cross-check — now spans the probe + terminal deciders (
TestDecidersDeriveConsistently), with the terminal none/open no-op reasonably excluded. This locks decider output againstDeriveLegacyStatusdrift. - #5 epoch-regex breadth — documented.
#1 (liveness-vs-activity composition) and #3 (detecting-memory clearing) acknowledged as LCM-PR contract follow-ups — agreed, those belong there.
LGTM. Good, fast turnaround.
…ting helper Address PR review (aa-1): - Move the decider input/output type definitions (LifecycleDecision, ProbeInput, ProcessLiveness, OpenPRInput, DetectingInput) out of the execution file into a dedicated types.go, leaving decide.go for behaviour. - Expand the detecting() helper doc to spell out that it adapts a probe verdict into the shared CreateDetectingDecision anti-flap path so each probe branch doesn't re-implement the quarantine counter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clarify the timestamp-stripping block flagged in review: spell out what each of the three regexes matches (full ISO/RFC3339 datetime, bare time-of-day, bare unix epoch) and why order matters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR implements the previously-stubbed, pure DECIDE-core logic for the backend lifecycle lane (LCM/SM), and adds comprehensive table-driven tests to lock in the decision truth tables.
Changes:
- Implemented deterministic deciders for probe reconciliation, PR ladder resolution, terminal PR handling, detecting→stuck escalation, and evidence hashing.
- Added evidence hashing normalization (timestamp stripping + whitespace collapse) to support stable detecting escalation.
- Added exhaustive table tests plus consistency checks against
domain.DeriveLegacyStatus.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
backend/internal/domain/decide/types.go |
Defines the DECIDE-core input/output shapes used by all deciders. |
backend/internal/domain/decide/decide.go |
Implements the decider logic, detecting escalation, and evidence hashing. |
backend/internal/domain/decide/decide_test.go |
Adds table-driven tests covering all decision branches and consistency invariants. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… contract Address Copilot review: - ResolveOpenPRDecision now sets a stable, timestamp-free Evidence summary "<condition> #<num> <url>" for every ladder outcome, consuming the previously-unused OpenPRInput.Number/URL identity inputs and making PR decisions traceable in logs. Covered by TestResolveOpenPRDecisionEvidence. - Document LifecycleDecision's zero-value contract: an empty PRState/PRReason means "this decider does not address PR — leave unchanged", not PRNone. The LCM must map empty PR fields to a nil LifecyclePatch.PR; writing PRNone on a probe tick would clobber a live PR. (Pointers were considered but the empty sentinel is distinguishable from every valid state and consistent with the codebase's value-enum style; LifecyclePatch already owns nil-means-unchanged.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…change Defer findings #4 (isGitRepo case-sensitivity) and #5 (GitChecker seam) out of this PR. Restores the original exec-based isGitRepo and the New(store) constructor; removes git.go, git_test.go, and the test-only export shims. The error-standardization and other findings are unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ayers (#95) (#96) * refactor(backend): LLD maintainability fixes in controllers/service layers Addresses the high + medium severity findings from the LLD review of backend/internal/httpd and backend/internal/service (#95): 1. Controllers no longer import internal/session_manager. Session sentinel errors are now *domain.ServiceError values carrying their own HTTP mapping, so the controller translates them with one generic errors.As — no cross-package sentinel imports. 2. One error pattern across services: project.Error is now an alias of the shared domain.ServiceError, and session_manager sentinels use it too. A single writeServiceError replaces the per-resource error switches. 3. Clean-orchestrator business logic moved out of the controller into session.Service.SpawnOrchestrator(ctx, projectID, clean). 4. isGitRepo no longer treats case-different paths as equal on case-sensitive filesystems; case-insensitive compare is gated to darwin/windows via samePath. 5. Project repo check sits behind an injectable GitChecker, so the service is testable without a real git binary. 6. httpd exports only the production constructors (NewWithDeps, NewRouterWithControl); the 3 test-only wrappers are removed and the "router with empty deps" convenience moved to an unexported test helper. Closes #95 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(backend): standardize service errors on internal/httpd/errors Replace the domain.ServiceError approach with a REST-API-scoped error package and a single envelope renderer, per review feedback: - Add internal/httpd/errors (package errors, aliased apierr): one structured Error type with semantic Kinds (Internal/Invalid/NotFound/Conflict) and constructors. Imports nothing, so any layer can depend on it. - envelope.WriteError is now the single path from a service error to the wire APIError, and the only place a Kind becomes an HTTP status/word. The per-resource writeProjectError/writeSessionError translators are gone. - Delete domain/errors.go (keeps domain pure of HTTP-flavored kinds) and service/project/errors.go (no per-service error files); services build errors inline via apierr constructors. - session_manager sentinels are apierr.Error values (pointer identity still works with errors.Is). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * revert(backend): drop GitChecker seam and isGitRepo case-sensitivity change Defer findings #4 (isGitRepo case-sensitivity) and #5 (GitChecker seam) out of this PR. Restores the original exec-based isGitRepo and the New(store) constructor; removes git.go, git_test.go, and the test-only export shims. The error-standardization and other findings are unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(session): translate engine errors to API errors at the facade The session_manager is the internal command engine and must not depend on the REST API error vocabulary. Revert its sentinels to plain errors.New values and move the engine→API translation into the service/session facade (toAPIError), which is the correct boundary. Controllers still see apierr.Error and never import the engine; the engine no longer imports internal/httpd/errors. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(session): tighten error comments to state what the code does Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * style(envelope): make KindInternal an explicit case in httpStatus Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(apierr): rename package, test SpawnOrchestrator, parity fixes Address review feedback on PR #96: - Rename internal/httpd/errors → internal/httpd/apierr (package apierr) so importers no longer alias around the stdlib errors package. - Add a commander seam to session.Service and unit-test the relocated clean-orchestrator rule: clean=true kills all active orchestrators before spawning; clean=false spawns without kills. - project.Add: wrap the UpsertProject store error in apierr.Internal for parity with its sibling paths (was a raw 500). - Document that KindInternal is iota's zero value, so a zero-value Error defaults to 500. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…viewer to worker harness Per PR #197 review feedback: - Reviewer agent posts its review to the PR itself, so remove the ports.PRReviewPoster port, the GitHub review poster, the submit HTTP route + DTO, and the service Submit method (#1, #4, #7). - Trigger spawns the reviewer agent over the worker's worktree with its own review prompt, mirroring the session launch flow (resolve agent by harness -> argv -> runtime.Create) (#8, #9). - Default reviewer harness reuses the worker's harness when supported, falling back to claude-code; reviewer config stays independent of the worker override (#5, #6). - Drop the `ao review` CLI for this PR's scope (#2, #3). Regenerated OpenAPI + TS types. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(review): configurable AO code review backend (V1)
Add per-project configurable code review of a worker's PR. A reviewer
agent runs one-shot over the worker's own worktree and posts its result
to the PR; the worker picks the feedback up through the existing SCM
observer review-nudge path.
- domain: ProjectConfig.reviewers (+ default reviewer harness), Review /
ReviewRun types and verdict/status vocab.
- storage: review + review_run tables (0011), sqlc queries, store methods.
- service/review: rewrite the in-memory stub as a persisted ReviewService
(Trigger/Submit/List) with a reviewer Runner over agent resolver +
runtime; ports.PRReviewPoster implemented on the GitHub adapter.
- http: session-scoped routes POST /sessions/{id}/reviews/trigger,
POST .../submit, GET .../reviews; regenerated OpenAPI + TS types.
- cli: ao review trigger|submit|list.
- frontend: adapt ReviewDashboard to the per-worker reviews API.
Closes #192
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* refactor(review): address review — drop submit/poster/CLI, default reviewer to worker harness
Per PR #197 review feedback:
- Reviewer agent posts its review to the PR itself, so remove the
ports.PRReviewPoster port, the GitHub review poster, the submit HTTP
route + DTO, and the service Submit method (#1, #4, #7).
- Trigger spawns the reviewer agent over the worker's worktree with its
own review prompt, mirroring the session launch flow (resolve agent by
harness -> argv -> runtime.Create) (#8, #9).
- Default reviewer harness reuses the worker's harness when supported,
falling back to claude-code; reviewer config stays independent of the
worker override (#5, #6).
- Drop the `ao review` CLI for this PR's scope (#2, #3).
Regenerated OpenAPI + TS types.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(review): restore ao review submit (records verdict+body in AO)
Per maintainer request, bring back `ao review submit`. AO records the
reviewer's verdict and body on the review_run and marks the pass complete;
it does not post to GitHub — the reviewer agent posts its review to the PR
itself.
- storage: add review_run.body (0011), persist via Insert/UpdateReviewRunResult.
- service: restore Submit (no SCM poster) storing verdict + body.
- http: restore POST /sessions/{id}/reviews/submit + SubmitReviewInput.
- cli: ao review submit [worker] --verdict --body (worker from arg/--session/$AO_REVIEW_WORKER).
- runner: reviewer prompt instructs posting to GitHub and recording via ao review submit.
Regenerated OpenAPI + TS types.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* refactor(review): move reviewer runner to its own package; sharpen prompt
Per PR #197 review:
- Move the concrete reviewer runner out of the service layer into a new
internal/review_runner package (package reviewrunner), beside other
orchestration packages like session_manager. The service keeps only the
Runner interface + RunSpec it depends on; the agent-resolver + runtime
launch flow lives in review_runner.
- Sharpen the reviewer prompt: tell the agent to diff against the PR base,
focus on high-confidence findings, post via `gh pr review`, and record
the result with `ao review submit`; review-only (no commits/edits).
- Add unit tests for the runner.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* refactor(review): simplify review_run schema; provider-agnostic reviewer prompt
Per PR #197 review:
- review_run: status default 'running' (drop 'pending'), drop CHECK
constraints on status/verdict, drop the updated_at column and the
session/iteration index. Propagated through queries, domain, store,
service, and tests.
- Reviewer prompt no longer hardcodes GitHub/gh commands — it instructs the
agent to use whatever review tooling the provider offers, keeping the
flow extensible across SCM providers.
Regenerated sqlc + OpenAPI/TS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* refactor(review): launch reviewer before persisting the run
Trigger now spawns the reviewer agent first and then writes the review_run
with a status derived from the launch outcome (running on success, failed
if it never started), instead of inserting a running row and correcting it
to failed afterwards.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* refactor(review): pluggable reviewer registry distinct from worker harnesses
Reviewers are now their own pluggable adapter set, separate from the worker
agent registry — adding a reviewer (claude-code today, greptile tomorrow) is
a one-line registration that does not widen the worker harness vocabulary,
and a worker harness does not automatically become a valid reviewer.
- domain.ReviewerHarness: a distinct vocabulary (AllReviewerHarnesses) with
its own IsKnown; ReviewerConfig/Review/ReviewRun use it. ResolveReviewerHarness
reuses the worker harness only when it is itself a supported reviewer, else
falls back to claude-code.
- ports.Reviewer: a reviewer-specific contract (ReviewCommand → argv + env)
that models one-shot / non-prompt CLIs natively instead of forcing every
reviewer through the worker's interactive GetLaunchCommand(Prompt:...).
- internal/adapters/reviewer: a separate registry + resolver (mirrors the
worker agent registry) with the claude-code reviewer adapter, which owns the
review prompt and reuses the worker claude-code launch construction.
- review_runner resolves via the reviewer registry (not the worker
AgentResolver) and merges AO_REVIEW_WORKER into the adapter's env.
- daemon wires the reviewer resolver. Registry/domain parity is test-enforced.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* test(review): cover run-scoped reviewer submit
* fix(api): update generated review submit schema
* refactor(review): split core engine (internal/review) from API service
Move the review orchestration (Trigger/Submit/List, run-id generation,
deps, RunSpec/Runner, sentinels) into a transport-independent core package
internal/review (Engine). internal/service/review is now a thin API-flow
boundary: the controller-facing Manager interface + a Service that delegates
to the engine + error re-exports.
This keeps the service layer to API concerns and lets the same engine back a
future in-process CLI trigger without going through HTTP. review_runner now
depends on the core package; daemon builds the engine and wraps it in the
service. No API/schema changes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* feat(review): commit-aware trigger, reviewer handle for UI, no env vars
Reworks the review trigger lifecycle and drops env-based coupling:
- review_run gains target_sha (the reviewed commit) and drops iteration.
A repeat trigger for the same PR head short-circuits to the existing run.
- review gains reviewer_handle_id: the live reviewer pane's runtime handle,
reused across passes and exposed in the reviews API so the UI can attach
its terminal over /mux.
- Trigger flow: if a live reviewer pane exists and a new commit arrived,
message it to re-review; otherwise spawn a fresh reviewer. The run is
recorded only after the reviewer is launched.
- No environment variables: the reviewer adapter embeds the explicit
`ao review submit --session <w> --run <id>` command in the spawn prompt
and the re-review message. CLI submit requires --run/--session (no env
fallbacks).
- Merge review_runner into internal/review as a Launcher (spawn/notify/alive).
- Trigger returns 201 for a new pass, 200 when reusing an existing run.
Regenerated sqlc + OpenAPI/TS.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* refactor(review): author the reviewer prompt centrally, not in the adapter
Mirror the worker model (session_manager builds the prompt; adapters just
place it via LaunchConfig.Prompt). The reviewer prompt now lives in
internal/review/prompt.go and is passed through ports.ReviewInvocation.Prompt;
the claude-code reviewer adapter just feeds inv.Prompt to its launch command
and returns it as the re-review message. One-shot CLI reviewers may ignore it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* refactor(review): split reviewer prompt into system+task, mirroring buildSpawnTexts
Mirror session_manager.buildSpawnTexts for the reviewer: a standing role goes
in the system prompt, the per-pass task (PR/commit + exact `ao review submit`
command) goes in the user prompt. internal/review/prompt.go now returns
(prompt, systemPrompt); both flow through ports.ReviewInvocation and the
claude-code adapter places them via LaunchConfig{Prompt, SystemPrompt}. The
re-review message reuses the per-pass prompt (role already established in the
running pane).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Vaibhaav <user@example.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Summary
Implements the pure DECIDE core for the LCM + Session Manager lane (PR #3), replacing the stubbed
panic(...)bodies inbackend/internal/domain/decide/decide.gowith real, total, side-effect-free deciders, plus exhaustive table tests.Deciders
ResolveProbeDecision— kill-intent short-circuits straight to terminalkilled; a failed probe (timeout/error) and probe disagreement (runtime/process mismatch, or runtime-dead with recent activity) route todetecting; only runtime-dead + process-dead + no-recent-activity concludeskilled. (Honors "failed probe ≠ dead" and "one authority for death".)ResolveOpenPRDecision— the PR pipeline ladder in exact order:ci_failing → changes_requested → approved+mergeable → approved → review_pending → idle-beyond → open.ResolveTerminalPRStateDecision— merged → idle /merged_waiting_decision; closed → idle.CreateDetectingDecision— anti-flap quarantine. Unchanged-evidence counter;StartedAtis preserved across the whole detecting episode so the duration cap is a real wall-clock safety net even when evidence flaps. Escalates tostuckatDetectingMaxAttempts(3) consecutive unchanged ticks orDetectingMaxDuration(5m).HashEvidence— strips ISO datetimes, bare time-of-day, and epoch numbers, then collapses whitespace before SHA-256, so restamped-but-unchanged signals compare equal.Notes for reviewers
domain/orports/; nodecideinput struct needed extending.LifecycleDecisioncarries no runtime sub-state fields, so the probe decider expresses runtime death viaSessionReason(runtime_lost/agent_process_exited) — the LCM is expected to translate that into the persisted runtime substate.DetectingInput.ProposedStateis kept for the contract but isn't load-bearing in the output (output state is alwaysdetectingor escalatedstuck); the reason is driven byProposedReason.Test plan
gofmt -l .prints nothinggo build ./...go vet ./...go test -race ./...internal/domain/decideStatusmatchesDeriveLegacyStatusover the canonical state it emits, and an end-to-end probe→detecting→stuck escalation flow🤖 Generated with Claude Code